# alha15 1. FPGA accelerated hadoop cluster for deep learning computations
- map-reduce
- data parallelism
  - common compute step
- deep learning kernels computationally intensive
- speedup:          12.6 times
- energy reduction: 87.5%
- nodes:            6
- fpga:             zedboard
"Researchers are currently training deep learning architecture particularly convolutional neural ntworks either by resording to special ahrdware such as gpu fpga or mapping training process into distributed omcputing clusxter such as hadoop map-reduce framework, but not both at the same time"
"amdahl's law and the assumption that the acceleration speedup of all convolution layers to be the same, we computed the overall speedup from FPGA acceleration over sequential execution to be 8.12 times"

# fisc20 2. BNNsplit binarized neural networks for embedded distributed FPGA-based computing systems
- map-reduce
- CNNs: floating point operations -> reducing weights to binary values for FPGAs
- FINN: state of the art BNN
- extension to run BNN on multi-FPGA systems
- hadoop: input data in distributed file system divided between mappers for fault tolerance, master divides input into splits and assigns split to a mapper node it usually assigned based on physical proximity
  - map: key-value pairs are mapped into key-value pairs, which are grouped by the framework by the second key of the key-value pair
  - reduce: applied in parallel to each group
"there have been various attempts to accelerate deep learning algorithms using either hadoop or FPGA technology, but not both at the same time"

# chun95 3. design and implementation of a multicomputer interconnection network using FPGAs
- four-by-four port interconnection network
- cell routing hubs
- traditional FPGA beenfits in system development including hardware/software codesign, architecture trade-off study and system debugging
  - benefits compared to ASIC development

# chun15 4. FPGA-based accelerator platform for big data matrix processing
- hadoop
- cluster of FPGA evaluation boards (EVBs)
- communication via Gigabit Ethernet switch
- 512x512 floating point matrix multiplications with four FPGA EVBs at 125MHz clock achieve 4x speedup as compared with i7-4770 CPU at 3.4GHz

# chun17 5. hadoop cluster with FPGA-based hardware accelerators for K-means clustering algorithm
- hadoop
- 4x speedup compared to hadoop cluster without FPGA-based accelerators
  - compared to machine learning (Apache) Mahout libraries
- evaluation boards (EVBs)
- Gigabit ethernet switch

# du19 6. the library for hadoop deflate compression based on FPGA accelerator
- hadoop: map-reduce
- accelerating a hadoop system with hardware by implementing compression options with FPGAs
- speedup ratio 6.42x, 6.28x and 3.25x
- "Apache Hadoop is the industry's mainstream big data processing software, running on a cluster, distribtued storage and distirbuted processind of large-scale data"
- compression and decompression at the same time due to timeline parallelism
- modified zlib, modified zpipe, modified testDFSIO (IO benchmarking)
- PCI-x4 hardware interface

# kalm16 7. clustering and mapping algorithm for application distribution on a scalable FPGA cluster
"the creation of an FPGA cluster introduces two major challenges."
- "how does the communication between the FPGAs take place?"
- "how will the application(s) be distributed among the FPGAs?"
- task interaction graph (TIG) mapped to board connection graph (BCG)
- "One challenge for FPGA clusters is the topology and interconnection type. The most common approaches connect FPGA boards via Ethernet or connect FPGA cards via PCIe"
  - or self built using bluelink (lightweight pluggable interconnect lirbary)
  - or wifi
- many approaches suffer from a poor scalability and become inefficient when it comes to build a large cluster
- "Another challenge for FPGA clusters is the application distribution."
  - several publications with single FPGA
  - network-on-chip using single FPGA
  - also called load-balancing

# jone99 8. AUX implementing an API for distributed adaptive computing systems
- two such application classes are embedded systems in which multiple baords are required to physically interface to different sensors/actuators and applications whose computational demands require mutliple boards
- the cluster computing paradigm is a cost-effective method of constructing small parallel computers using commercial off-the-shelf technology to exploit coarse- and medium-grain parallelism

# knod13 9. integration of a highly scalable multi-FPGA-based hardware accelerator in common cluster infrastructures
- offers simple/scalable integration of FPGAs in common cluster architecture, permit easy access to resources
- enables system-wide dynamic partitioning, batch-based administration and monitoring of FPGA resources
- "numerous applications in bio- and neuroinformatics are highly suitable for FPGAs"
- present efficiently working cluster architecture with distributed FPGAs
- "If no hardware is available, a simulation of the applications' new runtime environment is required. A testbed provides realistisc performance values but causes additional effort whereas a simulation can have negative effects on the performance values."
- "The currently used connection technologies range from simple streaming solutions realized with Gigabit Ethernet up to complex PCI Express solutions."
  - accelerator card developed by Pico Computing, holding up to six FGPAs per card
  - simple and user-friendly approach Convey-HC1 system
    - unlike in-socket FPGA co-processors from Nallatech it uses mezzanine connector to link the front side bus to an accelerator board with four user-programmable FPGAs
    - accessible with an OpenMP programming model
- emulation of parallel architectures
- "In most heterogenous clusters FPGAs are used as simple co-processors or accelerators connected over PCIe to the node's processor cores."
- tight coupling to host processor
- "The main characteristic of our concept is the dynamic allocation of FPGA resources and the flexible assignment between the number of host processors and FPGAs"
- "Our approach comprises basic protocol implementations for the FPGA-to-FPGA, FPGA-to_Host and FPTA-to-Cluster communication"
- PCI communication
- "To allow an efficient programming of the FPGA resources in the introduced heterogenous cluster environment, a framework is necessary. The Open Computing Langauge (OpenCL) framework is commonly used for the development and execution of programs across heterogenous platforms consisting of CPUs, GPUs and also FPGAs."

# nesh15 10. accelerating machine-learning kernels in hadoop using FPGAs
- hadoop: map-reduce
- comprehensive analysis of communication and computation overheads such as data I/O movements, and calling several standard libraries that can not be offloaded to the accelerator
- several data mining algorithms and applications
  - K-means clustering, KNN classification, SVM-learn and Naive Bayes classification
- speedup derivation using Amdahl's law
- "As results shown, the input data size have a significant effect on the speedup in some applications."
- design space analysis: "Mapping of applications to a heterogeneous architecture to benefit from the diverse core and accelerators is a complex problem, particularly because different phases of the same application will often prefer different cores or configurations and thus, require specific scheduling and mapping to find the best match. Making wrong scheduling decisions can lead to suboptimal performance and negatively impact power and energy consumption as well"

# theo14 11. interconnect for commodity FPGA clusters: standardized or customized?
- "Whilst soft cores for standard protocols (Ethernet, RapidIO, Infiniband, Interlaken) are a boon for FPGA-to-other-system interconnect, we argue that they are inefficient and unnecessary for FPGA-go-FPGA interconnect"
- makers of BlueLink
- on the idea of single PCB multi-FPGA and its pitfalls
  - "This requires complex design an simulation - for preossional designers a board takes about one man-year of design effort. FPGAs are typically found in advanced ball grid array packages, which also makes manufacturing difficult. In addition there is the headache of managing the whole process of parts procurement, production, test and debug."
  - "Secondly, many such boards (especially commercial products) are not regular - each FPGA is not connected to the same peripherals. This requries a separate synthesis run for each FPGA in a cluser, which makes it difficult to scale to a large number of FPGAs."
- "some applications do not requrie any communication between FPGAs: loosely coupled"
  - "mapreduce fits this model"
- "other applications are tightly coupled"
  - gate-level systom-on-chip simulation
- interconnect:
  - simple approach: GPIO pins using single-ended driving or low-voltage differential signalling (LVDS)
    - limited frequency about 1 GHz in LVDS mode
    - signal integrity and skew: short cables (typically centimetres) with careful (expensive) construction
      - limits size of cluster
  - proposes commodity FPGA boards (reduce costs, development time), serial interconnect using FPGA transceivers, low-cost commodity passive copper cabling between boards (optical for longer distances if necessary), multi-hop routing such that fully-connected network is not required
- compared against Altera 10G Ethernet MAC on Strativ V platform
- small message sizes, low latency, reliable, hardware-only, lightweight, ubiquitous and interoperable

# asse21 12. accelerating deep neuroevolution on distributed FPGAs for reinforcement learning problems
- sequential nature of problems poses a fundamental challenge
- "most appealing part of video games for reinforcement learning research is the availability of the game score as a direct reward signal, as well as the low cost of running large amounts of virtual experiments on computers without actual consequences"
- "training neural networks with derivative-free methods opens the door for innovrations in hardware beyond GPUs"
- IBM Neural Computer
- "rather than accelerating the optimization algorithm (e.g. RL or GA) we have taken a different approach and addressed the data generation (i.e. ATari game environemtn and obtaining frames"
- "within each node is a zynq-7045 system-on-chip, which integrates a dual-core Cortex A9 ARM processor and an FPGA, alongside 1GB of DRAM used both by the ARM CPU and the FPGA"
  - "for example, for game playing, a significant portion of the time is spent during the game itself, which results in a long sequence of inference of game frames and actions. communicating game scores and updating neural network weights are sparse in comparison. therefore, rather than accelerating the genetic algorithm, acceleration of the game environment and the inference can make a big difference as our results have shown."

# prit20 13. overview of the IBM neural computer architecture
- IBM neural computer (INC)
  - custom-designed distributed FPGA system developed by IBM Research
  - 416 FPGAs
  - 832 instances in parallel
- "while INC is a distributed system in that it is composed of distinct processor+memory nodes interconnected by communication links, it has a unique combiantion of features not available elsewhere. IOt is not a multi-FPGA 'sea of gates' sytem, whose structure would need to be defined by the logic resident on the FPGA. it has a very well defined structure of compute nodes with a well defined communications network. Therefore it does not carry the performance compromise associated with the need to support a fully-generic interconnect."
- 3x3x3 topology per card, along with (XYZ) co-ordinates overlaid to indicate the organization of the 3D mesh
  - 27 nodes placed on the card in a way to minimize the connection lengths between logically adjacent nodes
  - 27 identical nodes except for ethernet (node 100), controller (node 000) with 4 lane PCIe 2.0 connection (possibly to host PC) and serial (possibly serving as console during boot time), extra optional PCIe support (node 200) for addional bandwidth
- backplane up to 16 cards in a 12x12x3 mesh
- communication network currently supports directed and broadcast packet routing schemes
  - multiple virtual channels can be designed to sit atop the underlying router logic described in the previous section to give the processor ad FPGA logic different virtual or logical interfaces to the communication network
    - internet ethernet
      - emulates regular ethernet to take advantage of existing software
    - postmaster direct memory access (DMA)
    - bridge FIFO

# 14. inference

# 15. the cube
